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METHOD FOR DETECTING TARGET SOUND, METHOD FOR DETECTING DELAY TIME 
IN SIGNAL INPUT, AND SOUND SIGNAL PROCESSOR 

Technical Field 

— BACKGROUND OF THE INVENTION 

1_. Field of the Invention 

The present invention relates to a method for detecting a target 
sound and a program therefor, a method for detecting a delay time in 
signal input between sound signals inputted — input into plural 
microphones and a program therefor, a sound signal processor for 
processing inputted sound signals, and a voice recognition device for 
detecting a speech sound and processing voice recognition of the speech 
sound . 

Background ^ Description of the Related Art 

— In various forms of communication employed used by humans, 
voice is the most basic and excellent preferred form of communication 
means , with its information transmission speed higher than any other 
information transmission method. Thus, until recently, the voice has 
served as the basis of human communication mcano since ancient times 
until nowadays . . 

— There are proposed voice recognition techniques for 
recognizing the voice . Voice recognition includes extracting the most 
basic information on the semantic contents, or phonological 
information, from the information contained in. the voice with a 
computer or the likc other data processing device , and determining the 
extracted contents. In recent years, thcy attempts have been 
attempting m ade to apply such voice recognition techniques as a 
man-machine interface in various fields, with the drastic development 
of computer processor technology and the construction of advanced 
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information networks, typically the Internet. 

— The recognition performance of current voice recognition 
systems has improved greatly by^ with the utilization of probabilistic 
and statistical schemes. In the case of voice in ideal environments 
and voice collected at a short distance with a close- talking microphone, 
a significantly high recognition rate can be is obtained. 

— However, when it comes to voice recognition in actual 
environments, the recognition rate is inferior- because of the 
mismatch between learning data and observed data in their environments, 
contents of speeches, and the like. other factors. In addition, tfee 
users suffer great burden and discomfort from a close- talking 
microphone headset as a sound reception system worn by the user. This 
significantly hinders the practical application of voice recognition 
systems . 

— Further, many studies have been conducted on voice recognition 
methods using plural remote microphones for picking up remote voice— 
wkireh s. However, such studies have shown that it is difficult to 
recognize owing to ito the remote voices because of their lower S/N 
ratio, influences of background noise and room reverberation, and the 
like. other factors. A typical one of them io a method using uses a 
microphone array. This method can perform three types of spatial 
signal processing, namely sound source position detection processing, 
target sound emphasis processing, and noise suppression processing. 
Remote voice recognition is being extensively researched using methods 
such as the method described above. 

— However, this method requires plural microphones to be fixed 
at regular intervals for accurate identification processing of the 
direction of the speaker^ and thus thc^ downsizing and mobilization 
of such a method is difficult. Therefore, there is a problem that 
this method is difficult to apply to voice input in various environments 
and under various circumstances and thus has limited uses. 

— As a "ubiquitouo " mobile sound reception system enabling 
anytime/anywhere sound input, there ia an expectation of mountable 
microphones that can be attached to clothes, glasses or the likc other 
articles can be provided , which (1) are compact, and lightweight for 
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easy mounting/removing, (2) can ensure short -distance sound pickup 
generally ao good ao similar to close-talking microphones, and (3) 
can caoc reduce the burden and discomfort when mounted to the user as 
compared to close- talking microphone headsets. 
The 
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SUMMARY OF THE INVENTION 
To overcome the problems described above, preferred embodiments 
of the present invention hao been made in view of the foregoing problems , 
and therefore hao an object of providing p rovide a method for detecting 
a target sound, a method for detecting a- delay time in signal input, 
a sound signal processor, a voice recognition device, and programs 
therefor, which enable the construction of a sound reception system 
employing including plural mountable microphones and robuot againot 
which is highly resistant to environmental fluctuations . 

Diocloourc of the Invention 

— A method for detecting a target sound according to a preferred 

embodiment of the present invention comprises : — i nc 1 ude s i npu 1 1 i ng 
detection target sounds output-ted from a detection target sound source 
into plural microphones-^ detecting a phase of a cross -spectrum 
between sound signals inputted input into the plural microphones-^ 
detecting an inclination of the phase of the cross-spectrum with 
respect to the a frequency due to respective distances from the 
detection target sound source to the plural microphones-^ and, based 
on the inclination, detecting the target determining whether the sound 
received by input to the plural microphones includes the target sound . 

The above method for detecting a target sound may comprioc: 

preferably includes dividing the frequency according — te into a 
plurality of bands, detecting the baftdrr inclination of the phase for 
each of the plurality of bands, and, based on the detected inclinations 
of the phase of each band divided, detecting of the plurality of bands, 
determining whether the sound input into the plural microphones 
includes the target sound. 

The above m ethod for detecting a target sound may compriac: also 

preferably includes detecting the target sound when a tendency that 
the detected inclinations of each band concentrate on the plurality 
of bands are concentrated near a specific inclination ia strong . 

— The above method for detecting a target sound may compriac ; 
preferably includes dividing the sound signals inputted that are input 
into the plural microphones into predetermined time sectionsr_/_ and 
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detecting the phase of the cross -spectrum between the sound signals 
in each time section. 

— A method for detecting a delay time in signal input according 
to another preferred embodiment of the present invention comprioco : 
includes inputting sounds outputtcd that are output from a sound source 
into plural microphones-r^ detecting a phase of a cross -spectrum 
between sound signals inputtc d that are input into the plural 
microphones-^ detecting an inclination of the phase of the 
cross -spectrum with respect to the a frequency due to respective 
distances from the sound source to the plural microphones-^ and, based 
on the inclination, detect ing det ermining the delay time in sound 
reception signal input of the sounds input into the plural microphones 
from the sound source between the plural microphones . 

— The above m ethod for detecting a delay time in signal input 
may comprioc: preferably includes dividing the frequency according 
fe ointo a plurality of bands, detecting the baad- rinclination of the 
phase of each of the plurality of bands, and, based on the detected 
inclinations of the phase of each band divided, — dctccting of the 
plurality of divided bands, determining the delay time in the aound 
reception . 

— The above m ethod for detecting a delay time in signal input 

may compriac : dctccting pref erably includes determining the delay 

time in the oound reception when a tendency that the inclinations of 
each band concentrate on of the plurality of bands are concentrated 
near a specific inclinatio n io otrong . 




may comprioc : preferably includes dividing the sound signals inputted 
into the plural microphones into predetermined time sections-^ and 
detecting the phase of the cross-spectrum between the sound signals 
in each time section. 

— A sound signal processor according to another preferred 

embodiment of the present invention compriaco : includes a 

cross-spectrum phase detection mcano detector for detecting a phase 
of a cross-spectrum between sound signals inputted: into plural 
microphonesr / an inclination detection mcana detector for detecting 
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an inclination of the phase of the cross -spectrum detected by the 
cross-spectrum phase detection mcano detector with respect to tefee-a 
f requencyr^. and a_target sound detection mcana detector for detecting 
whether the sound input into the plural microph ones includes a target 
sound outputtcd from a detection target aound aourcc and received by 
the plural microphones - based on the inclination with respect to the 
frequency detected by the inclination detection mcano detector . 

_The abeve -inclination detector of the sound signal processor 
maybe chnrnrt"" *fl that f^n i nnl i nntion detection moanopreferably 
divides the frequency of the phase of the cross -spectrum according 
to the band into a plurality of bands and detects inclinations of each 
H.ri fefiMi nnri that the tarcrct of the plurality of b a nds, and the 
target sound detector detects whether the sound detection mcano 
detects input into the plural microphones includes the target sound 
based on the inclinations - inclination of each feaadof the plurality 
of bands detected by the inclination detection mcano . detector . 

—A sound signal processor for processing a sound outputfced from 
a sound source and inputted into plural microphones according to 
another preferred embodiment of the present invention comprioco : 
includes a cross-spectrum phase detection mcano detector for detecting 
a phase of a cross -spectrum between sound signals inputted into the 

plural microphones?-., an inclination detection mcano detector for 

detecting an inclination of the phase of the cross- spectrum detected 
by the cross-spectrum phase detection mcano detector with respect to 
the a freguencyr, a delay time detection mcano detector for detecting 
a delay time in the sound fe ^r*" H nn firn ™ thr - nQund ooui'ce bctwccnsignals 
input into the plural microphones based on the inclination with respect 
to the frequency detected by the inclination detection mcano; 
aa ddetector . 

The present preferred embodiment also preferably i ncludes a 
sound signal synthesizer for synthesizing mcano for aynthcoiaing the 
sound signals inputted that are input into the plural microphones based 
on the delay time detected by the delay time d e tection mcanodetector . 

The abeve- inclination detector of the sound signal processor 

maybe eh or- rmd ^ ^nt- t-hn innlinntinn detection mcanopref erably 
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divides the phase of the cross-spectrum according to the band into a 
plurality of bands and detects inclinations of each band divided; and 
that — of the plurality of bands, and the delay time detection 
mcano detector detects the delay time in the oound reception based on 
the inclinations of each baad of the plurality of bands detected by 
the inclination detection mcano detector . 

A sound signal processor for processing a detection target sound 

outputted from a detection target sound source and inputted into plural 
microphones according to another preferred embodiment of the present 

invention comprioco : includes a cross-spectrum phase detection 

mcano detector for detecting a phase of a cross-spectrum between sound 
signals inputted that are input into the plural microphones- r, an 
inclination detection mcano detector for detecting an inclination of 
the phase of the cross-spectrum detected by the cross-spectrum phase 
detection mcano detector with respect to tkea f recfuency-r / a delay time 
detection mcano detector for detecting a delay time in the sound 
rnnnpfinn from the detection target oound oourcc bctwccn signals input 
into the plural microphones based on the inclination with respect to 
the frequency detected by the inclination detection mcano ; detector , 
a sound signal oynthcoizing mcano synthesizer for synthesizing the 
sound signals inputted that are input into the plural microphones based 
on the delay time detected by the delay time detection mcano; detector , 
and a target sound detection mcano detector for detect ing determining 
whether the target sound in the synthesized sound signals synthesized 
by the sound signal ovnthcoising mcano synthesizer includes a target 
sound based on the inclination with respect to the frequency detected 
by the inclination detection mcano detector . 

T he above inclination detector of the sound signal processor 
mny hn nhnmntnrigcd in that the inclination detection mcano pref erably 
divides the phase of the cross-spectrum according to the band into a 
plurality of bands and detects inclinations of each band divided; that 
of the plurality of bands, the delay time detection mcano detector 
preferably detects the delay time in the oound reception based on the 
inclinations of each baftd of the plurality of bands detected by the 
inclination detection mcano ; detector , and that the target sound 
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detection mcano d etector preferably detects the target sound based on 
the inclinations of each ba^ dof the plurality of bands detected by 
the inclination detection mcano detector . 

A voice recognition device for processing a speech sound 

outputted from a speech sound source and inputted into plural 
microphones according to another preferred embodiment of the present 

invention comprioco : includes a cross-spectrum phase detection 

mcana detector for detecting a phase of a cross-spectrum between sound 
signals input ted that are input into the plural microphones-^ / an 
inclination detection mcana detector for detecting an inclination of 
the phase of the cross-spectrum detected by the cross-spectrum phase 

detection mcana detector with respect to the a frequency-^/ a speech 

sound detection mcana detector for detecting the apeech oound received 
by whether the sound signals input into the plural microphones includes 
the speech sound based on the inclination with respect to the frequency 
detected by the inclination detection mcano ; detector , and a voice 
recognition proccoaing — mcano — processor for performing voice 
recognition processing of the speech sound detected by the speech sound 
detection mcano detector . 

The above inclination detector of the voice recognition device 

maybe characterized in that the inclination detection mcano pref erably 
divides the frequency of the phase of the cross -spectrum according 
to the band into a plurality of bands and detects inclinations of each 
band divided; and that the opcech oound detection mcano of the plurality 
of bands, and the speech sound detector preferably detects the speech 
sound based on the inclinations of each feaftd of the plurality of bands 
detected by the inclination detection mcano detector , 

A voice recognition device for processing a speech sound 

outputted from a speech sound source and inputted into plural 
microphones according to another preferred embodiment of the present 

invention comprioco : includes a cross -spectrum phase detection 

mcano detector for detecting a phase of a cross -spectrum between sound 
signals inputted into the plural microphonesr , an inclination 
detection mcano detector for detecting an inclination of the phase of 
the cross-spectrum detected by the cross-spectrum phase detection 
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mcano detector with respect to the a frequency- r, a delay time detection 
mcano detector for detecting a delay time in oound reception from the 
opccch pound aourcc bctwccn the sound signals input into the plural 
microphones based on the inclination with respect to the frequency 
detected by the inclination detection mcano / detector, a sound signal 
oynthcoizing mcano synthesizer for synthesizing the sound signals 
inputted into the plural microphones based on the delay time detected 
by the delay time detection mcano / detector, a speech sound detection 
mcano detector for detecting the opccch oound in whether the synthesized 
sound signals synthesized by the sound signal oynthcoizing 
mcano synthesizer include the speech sound based on the inclination 
with respect to the frequency detected by the inclination detection 
mcano / detector , and a voice recognition proccooing mcano p rocessor 
for performing voice recognition processing of the speech sound 
detected by the speech sound detection mcano detector . 

The above inclination detector of the voice recognition device 

maybe characterized in that the inclination detect ion mcano pref erably 
divides the phase of the cross -spectrum according to the band into a 
plurality of bands and detects inclinations of each band divided/ 
that of the plurality of bands, the delay time detection mcano detector 
detects the delay time in the oound reception based on the inclinations 
of each baft dof the plurality of bands detected by the inclination 
detection — mcano / detector , and that — the speech sound detection 
mcano detector detects the speech sound based on the inclinations of 
each ba^ dof the plurality of bands detected by the inclination 
detection mcano detector . 

A program according to another preferred embodiment of the 

present invention makco enables a computer to perform proccooing a 
process of detecting a target sound, the proccooing — comprioing : 
process includes the steps of inputting detection — target sounds 
outputted from a detection — target — sound source into plural 
microphones-^ detecting a phase of a cross-spectrum between sound 
signals inputted into the plural microphones-^ detecting an 
inclination of the phase of the cross-spectrum with respect to ^hea 
frequency due to respective distances from the detection target sound 
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source to the plural microphones-r^ and, based on the inclination, 
dctccting determining whether the target sound outputtcd from the 
d e t ect io n i-n-gnh nminrt nnMTnr. and received by input into the plural 
microphones includes the target sound . 

A program according to another preferred embodim ent of the 
present invention makco enables a computer to p erform processing^ 
process of detecting a delay time in sound reception, the processing 

comprising: input, the process including the steps of inputting 

sounds outputfeed from a sound source into plural microphones-^ 
detecting a phase of a cross -spectrum between sound signals inputted 
into the plural microphones-^ detecting an inclination of the phase 
of the cross-spectrum with respect to the frequency due to respective 
distances from the sound source to the plural microphones-^ and, based 
on the inclination, detecting the delay time in sound reception from 
the sound source between determining a delay time in signal s input into 
the plural microphones . 

Examining the phase of a cross-spectrum of plural sound signals 

picked up by plural microphones, the inclination of the phase with 
respect to the frequency is constant, depending on the difference 
between the respective distances from the sound source to the 
microphones. The difference between the respective distances from 
the sound source to the microphones appears as a delay time in sound 
reception between the plural microphones. When the S/N ratio of the 
sound picked up by the plural microphones is h^he^increased, the 
tendency fee — sttehof a constant inclination is me^e — notable, 
ghe increased. Various preferred embodiments of the present invention 
utilinos pref erably utilize this relation ship . 

That is, in various preferred embodiments of the present 
invention, the phase of a cross -spectrum between sound signals 
inputted into plural microphones is detected-^ the inclination of the 
phase of the cross-spectrum with respect to the frequency due to the 
respective distances from the sound source to the plural microphones 
is detected-?-^ and, based on the detected inclination, it is determined 
whether a detection target sound or speech sound has been received 
by the plural microphones is detected. _ The detection target sound 



10 



includco may include ambient sound produced by aubotancco , in addition 
to speech sound produced by humans . 

The Various preferred embodiments of the present invention 
^ operate based on the principle that, examining the phase of a 
cross-spectrum of plural sound signals picked up by input into plural 
microphones, the inclination of the phase with respect to the frequency 
is constant, depending on the difference between the distances from 
the sound source to the microphones, and that the tendency -feeof such 
a constant inclination is more notablc increased when the S/N of the 
sound picked up by the plural microphones is higher increased . 

In addition, in various preferred embodiments of the present 

invention, aloo, the phase of a cross-spectrum between sound signals 
inputted into plural microphones is detected-^ the inclination of the 
phase of the cross-spectrum with respect to the frequency due to the 
respective distances from the sound source to the plural microphones 
is detected-^ and, based on the inclination, a delay time in reception 
of sound rcccption or sound signals between the plural microphones is 
detected. 

The Various preferred embodiments of the present invention 
irs operate based on the principle that, examining the phase of a 
cross-spectrum of plural sound signals picked up by input into plural 
microphones, the inclination of the phase with respect to the frequency 
is constant, depending on the difference between the respective 
distances from the sound source to the microphones, and that the 
difference between the respective distances from the sound source to 
the microphones appears as a delay time in sound reception between 
the plural microphones. 

In various preferred embodiments of the present invention, the 
frequency of the phase of a cross-spectrum is divided according to 
the band into a plurality of bands , and the processing is performed 
based on the inclinations of each bemd of the plural i ty of divided bands . 
This allowo provides detection of the inclinations with high accuracy. 

Other features, elements, steps, characteristics and advantages 
of the present invention will become more apparent from the following 
detailed description of preferred embodiments with reference to the 
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attached drawings. 



BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 is a block diagram showing the entire construction of 

a system including a sound signal processor of an a preferred embodiment 

of the present invention. 

FIG. 2 is a block diagram showing the construction of a sound 

signal processor of a first preferred embodiment of the present 

invention. 

FIG. 3 is a property diagram showing the phase of a cross -spectrum 

in respective environments. 

FIG. 4 is a property diagram showing the phase of a cross -spectrum, 

in which (A) is a property diagram showing the phase of a cross -spectrum 
of a voiced frame and (B) is a property diagram showing the phase of 
a cross-spectrum of a voiceless frame. 

FIG. 5 is a property diagram showing a histogram obtained based 

on the phase of a cross -spectrum, in which (A) is a property diagram 
showing a histogram of a voiced frame and (B) is a property diagram 
showing a histogram of a voiceless frame. 

FIG. 6 is a block diagram showing the construction of a histogram 

etc. calculating section and the like of the sound signal processor. 

FIG. 7 is a property diagram used for describing the effects 

of the sound signal processor of the first preferred embodiment of 
the present invention . 

FIG. 8 is a block diagram showing the construction of a sound 

signal processor of a second preferred embodiment of the present 
invention. 

FIG. 9 is a diagram used for describing the Overlap-add method 

for generating synthesized signals. 

FIG. 10 is a property diagram used for describing the effects 

of the sound signal processor of the second preferred embodiment of 
the present invention . 

FIG. 11 is a block diagram showing the construction of a sound 

signal processor of a third preferred embodiment of the present 
invention. 
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FIG. 12 is a block diagram showing another construction of a 

voiced/voiceless determining section of the sound signal processor. 



Beat Mode for Carrying Out the Invent io nD ETAI LED DESCRIPTION OF 

PREFERRED EMBODIMENTS 

Ah A preferred embodiment of the present invention is described 
below in detail with reference to the drawings. As shown in FIG. 1, 
this preferred embodiment is a sound signal processor 10 for processing 
sound signals picked up by two microphones 1 and 2. The first and 
second microphones 1 and 2 are preferably of a mountable type that 
can be mounted to a sound source (user) with a comparatively high degree 
of freedom in their mounted pooitiono mounting locations . 

FIG. 2 shows the construction of the sound signal processor 10 

of a first preferred embodiment . As shown in FIG. 2, the sound signal 
processor 10 includes first and second framing sections 11 and 12, 
first and second frequency analyzing sections 13 and 14, a 
cross-spectrum calculating section 15, a phase extraction processing 
section 16, a phase unwrap processing section 17, a main calculating 
section 30, and a sound input on/off control section 18. The main 
calculating section 30 includes a frequency band dividing section 31, 
first through N-th inclination calculating section 32i through 32 N , 
a histogram etc . — calculating section 33, and a voiced/voiceless 
determining section 34. The processing operation of each section is 
described below. 

Two-channel sound signals inputted from the first and second 

microphones 1 and 2 are inputted into the first and second framing 
sections 11 and 12, respectively. The sound signals inputted from 
the first microphone 1 are also inputted into the sound input on/off 
control section 18. 

The first and second framing sections 11 and 12, the first and 

second frequency analyzing sections 13 and 14, and the cross-spectrum 
calculating section 15 calculate a cross-spectrum of the two-channel 
sound signals inputted from the first and second microphones 1 and 
2. 

For example, when sound signals picked up by plural microphones, 
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such as the first and second microphones 1 and 2, are observed in a 
time series, there is a phase difference between the received sound 
signals. This results from the difference between the arrival times 
of the sound signals from the sound source to the microphones 1 and 
2 due to the difference between the distances from the sound source 
to the microphones 1 and 2 . 

—Here, a case is examined in which-^- the delay time between the 
sound signals picked up by the first and second microphones 1 and 2 
is measured^ the phases of those signals are synchronized based on 
the measured delay time-r^ and then the sound signals picked up by the 
first and second microphones 1 and 2 are added to obtain synchronized 
added sound. Such a technique for obtaining synchronized added sound 
as described above is disclosed in, for example, a literature "Acoustic 
event — loocalination — using — a — orooopowcr apectruum — phase — baccd 
tcchniouc Event Localization Using A Crosspower-Spectrum Phase Based 
Technique , " by M. Omologo, P. Svaizer et al . , Proc . ICASSP94 , pp. 
274-276 (1994) . 

The sound signals picked up by the two, first and second 

microphones 1 and 2 are represented as x x (t) and x 2 (t) , respectively, 
and frequency functions obtained by Fourier transformations of these 
sound signals x x (t) and x 2 (t) are represented as X^co) and X 2 (co), 
respectively. The sound signal x 2 (t) is assumed to be a time-shifted 
waveform of the sound signal x x (t) as represented by the following 
equation (1) : 

X 2 (t) = Xi(t - t 0 ) (1) 

On this assumption, the relation relationship between the 
frequency functions X^o) and X 2 (g>) can be represented by the following 
equation (2) : 

X 2 (co) = e^co^X^co) (2) 

Then, from the frequency functions X 1 (w) and X 2 (co), a 
cross- spectrum G 12 (co) can be obtained as represented by the following 
equation (3) : 

G 12 (co) = X 1 (o))X 2 *(«) = X 1 (co)e j Q to X 1 *(co) - IX^'eV 0 (3) 

The exponent term of the cross-spectrum G 12 (co) corresponds to 

the time delay between the channels in the spectrum region. Thus, 
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X 2 (w)e 3 Ci> / obtained by multiplying the frequency function X 2 by the 
delay term e j co to , is synchronized with the frequency function X lt 
whereby the inverse Fourier transform of X 1 (co) + X 2 (co) e j co to can be dealt 
with used as channel -synchronized-added sound. 

The cross -spectrum G 12 (co) such as described above is obtained 

by the cross-spectrum calculating section 15. 

To this end, firot of all, the first framing section 11 performs 

framing of the sound signals inputted input from the first microphone 

I (or divides them into frames) , in preparation for the first frequency 
analyzing section 13 as the next otcp , and outputs the results to the 
first frequency analyzing section 13. Also, the second framing 
section 12 performs framing of the sound signals inputted from the 
second microphone 2 (or divides them into frames) , in preparation for 
the second frequency analyzing section 14 ao the next otcp , and outputs 
the results to the second frequency analyzing section 14. The first 
and second framing sections 11 and 12 progressively divide the inputted 
input sound signals into frames, with each frame containing including 
a predetermined number of samples. 

For example, when no voice (speech) is inputted input into the 

microphones 1 and 2, voiceless frames carrying no voice are generated, 
and when voice is inputted input into the microphones 1 and 2, voiced 
frames carrying voice (speech) are generated. 

The first frequency analyzing section 13 performs Fourier 

transformations of the sound signals from the first framing section 

II to calculate the frequency function Xi(co), and outputs it to the 
cross-spectrum calculating section 15 as the next step. The second 
frequency analyzing section 14 performs Fourier transformations of 
the sound signals from the second framing section 12 to calculate the 
frequency function X 2 (co), and outputs it to the cross -spectrum 
calculating section 15— as — fehe — n e xt — otop . The first and second 
frequency analyzing sections 13 and 14 perform a Fourier 
transformation for each frame of the sound signals. 

The cross-spectrum calculating section 15 calculates the 

cross-spectrum G i2 (co) based on the frequency functions Xi(co) and X 2 (co) 
obtained from the first and second frequency analyzing sections 13 



15 



and 14, using the equation (3). 

FIG. 3 shows examples of the phase of a cross -spectrum of sound 

signals for one frame. In FIG. 3, (A) shows the phase of a 
cross-spectrum obtained from sound produced in a car, (B) shows the 
phase of a cross -spectrum obtained from sound produced in an office 
space, (C) shows the phase of a cross-spectrum obtained from sound 
produced in a soundproof room, and (D) shows the phase of a 
cross-spectrum obtained from sound produced on a sidewalk (outdoor) . 
As shown in FIG. 3, the phase of the cross-spectrum exhibits a generally 
constant inclination with respect to the frequency within a frame, 
in other words — locally, — depending on the difference between the 
distances from the sound source to the first and second microphones 
1 and 2. In other words, the phase component of the cross -spectrum 
has a constant inclination depending on the difference between the 
distances from the sound source to the first and second microphones 
1 and 2 . 

When the S/N ratio of the sound signals picked up by the first 

and second microphones 1 and 2 is higher increased , the tendency fee 
such of a constant inclination is more notable . increased. Since the 
first and second microphones 1 and 2 are preferably of a mountable 
type, the S/N ratio of the sound signals picked up by the first and 
second microphones 1 and 2 is high. Thus, each of the phases of the 
cross-spectra apparently exhibits a constant inclination. 

The cross-spectrum calculating section 15 outputs a 

cross-spectrum Gi 2 (o>) with such properties to the phase extracting 
section 16. 

The phase extracting section 16 extracts (detects) the phase 

of the cross-spectrum G 12 (co) obtained from the cross-spectrum 
calculating section 15, and outputs the results of the extraction to 
the phase unwrap processing section 17. 

The phase unwrap processing section 17 unwraps the 

cross-spectrum Gi 2 (co) based on the results of the phase extraction in 
the phase extracting section 16, and outputs the results of the 
upwrapping unwrapping to the frequency band dividing section 31 of the 
main calculating section 30. 
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The frequency band dividing section 31 outputs segments obtained 

by dividing the phase according to the band to the first through N-th 
inclination calculating sections 32 x through 32 N , respectively. 

Note that there is a great signif icant difference in the phase 

components of a cross -spectrum between voiceless frames carrying no 
voice and voiced frames carrying voice. That is, the phase of a 
cross -spectrum has a generally constant inclination with respect to 
the frequency in voiced frames , whercao and does not in voiceless frames 
A description is made with reference to FIG. 4. 

FIG. 4 shows examples of the phase of a cross -spectrum (CRS) . 

In FIG. 4, (A) shows the phase of a cross -spectrum of a voiced frame, 
and (B) shows the phase of a cross-spectrum of a voiceless frame. 

As can be seen from this comparison of FIG. 4 (A) and FIG. 4 (B) , 

the phase of a cross -spectrum in voiceless frames has no specific trend 
with respect to the frequency. In other words, the phase of a 
cross -spectrum does not have a constant inclination with respect to 
the frequency. This is because the noise has a random phase. 

On the other hand, the phase of a cross-spectrum in voiced frames 

has a constant inclination with respect to the frequency. This 
inclination depends on the difference between the distances from the 
sound source to the microphones 1 and 2. 

As described above, there is a great s ignif icant difference in 

the phase components of a cross-spectrum between voiceless frames 
carrying no voice and voiced frames carrying voice. 

In view of the above, the frequency band dividing section 31 

divides the phase components into small frequency segments (or divides 
them according to the band) and the first through N-th inclination 
calculating sections 32 x through 32 N ao the next otcp calculate the 
inclinations of each segment by applying the least squares method, 
so as to follow the trend correctly even when the phase is rotated. 
The first through N-th inclination calculating sections 32 x to 32 N 
respectively output, the calculated inclination to the histogram etc . 
calculating section 33. 

The method for obtaining the inclinations of each segment by 

applying the least squares method is a known technique disclosed, for 
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example, in "Introduction to Signal Processing and Image Processing," 
by Nobukatsu Takai, Kougakusha (2000) . 

The histogram etc. calculating section 33 obtains a histogram 

based on the inclinations calculated by the first through N-th 
inclination calculating sections 32i to 32 N . 

FIG. 5 shows histograms obtained by the histogram etc . 
calculating section 33, with each histogram showing inclinations by 
the segment. In other words, FIG. 5 shows the distribution of 
inclinations of the phase, with the vertical axis representing the 
ratio, or incidence, of the segments of each inclination to all the 
segments. In FIG. 5, (A) shows a histogram of a voiced frame, and 
(B) shows a histogram of a voiceless frame. 

As can be seen from this comparison of FIG. 5 (A) and FIG. 5 (B-H. 
of FIG. 5, in voiced frames, the histogram obviously has a peak valuer 
that . That is, the inclinations are localized within a significantly 
narrow range, with a high incidence of inclinations of a specific range . 
In other words, feke there is a strong tendency that of the inclinations 
of each band conccntratc to be concentrated on a specific inclination 
is atrong. On the other hand, in voiceless frames, the histogram 
takco has a smooth shape, with the inclinations distributed over a wider 
range . 

The histogram ete^calculating section 33 outputs the incidences 
obtained by creating these histograms to the voiced/voiceless 
determining section 34. A specific example of the processing 
performed by the histogram etren — calculating section 33 will be 
described latcr below . 

The voiced/voiceless determining section 34 determines voiced 
and voiceless sections based on the incidences obtained from the 
histogram etc. calculating section 33. For example,- a section is 
determined to be a voiced section when the occurring incidence of 
inclinations included within a predetermined range around the mean 
value of the incidences is not less than a predetermined threshold, 
whereas a section is determined to be a voiceless section when that 
occurring incidence is less than the predetermined threshold. 

Here, a frame is determined to be a voiced frame or a voiceless 
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frame, since the processing at the previous step was performed frame 
by the frame. The voiced/voiceless determining section 34 outputs 
the determination results to the sound input on/off control section 
18 . 

The sound input on/off control section 18 receives the sound 
signals from the first microphone 1, and switches on and off these 
sound signals to be outputtcd output to the next step based on the 
determination results of the voiced/voiceless determining section 34 . 
Specifically, when the voiced/voiceless determining section 34 
determined determines sound signals to be a voiced section, the sound 
input on/off control section 18 switches on so as to output the sound 
signals to the next step. When the voiced/voiceless determining 
section 34 determined sound signals to be a voiceless section, the 
sound input on/off control section 18 switches off so as not to output 
the sound signals to the next step. 

Here, the sound input on/off control section 18 switches on and 
e£^-the part portion of the sound signals on and off as the a_unit from 
the first microphone 1 corresponding to the frame on which the 
determination was made, since the processing at the previous step was 
performed frame by the frame. 

A specific example of the processing performed by the histogram 

etc . — calculating section 33 is described. FIG. 6 shows the 
construction of the histogram efee-: — calculating section 33 for 
implement ing per forming the processing. 

The histogram etc. calculating section 33 preferably includes 

a first switch 33S1, a second switch 33S2, and a mode calculating 
section 33C, ao a conotruction for calculating an inclination of a 
high incidence (modal inclination) from the inclinations calculated 
by the first through N-th inclination calculating sections 32i through 
32 N . The histogram etc. calculating section 33 switches on (closed) 
the first switch 33S1 for a given period, to create data (or a database) 
33D1 of inclinations for the given period calculated by the first 
through N-th inclination calculating sections 32 1 through 32 N . Note 
that the second switch 33S2 is kept off (opened) at this time. When 
the data 33D1 are created, the second switch 33S2 is switched on 
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(closed-K- ) so as to output the data 33D1 to the mode calculating section 
33C. 

The mode calculating section 33C creates a histogram 

representing the inclinations as shown in FIG. 5 from the data 33D1, 
and calculates the inclination of the highest incidence (hereinafter 
referred to as modal inclination) xO in the histogram. Instead of 
calculating the inclination of the highest incidence, it is also 
possible to calculate the inclination of the mean value tO or an 
inclination xO as a combination of the inclination of the highest 
incidence and the mean value of the inclinations . Thus, when tfeethere 
is a strong tendency that of the inclinations of each band conccntratc to 
be concentrated on a specific inclination is strong , the exact value, 
or an approximate value, of the specif ic inclination can be is obtained. 
In this preferred embodiment , the mode calculating section 33C 
calculates the modal inclination tO. 

Then, the mode calculating section 33C outputs the calculated 

modal inclination tO to the voiced/voiceless determining section 34. 
The modal inclination tO is outputted: to the voiced/voiceless 
determining section 34 as data 33D2. 

The foregoing is one specific example of the processing performed 

by the histogram etc. calculating section 33 but is in no way limiting 
thereof . 

The voiced/voiceless determining section 34 determines voiced 

and voiceless sections based on the modal inclination tO from the 
histogram etc. calculating section 33. 

In the preceding description, the voiced/voiceless determining 

section 34 dot ermined det ermines voiced and voiceless sections based 
on the incidences obtained from the histogram etc . calculating section 
33. The voiced/voiceless determining section 34 determines voiced 
and voiceless sections based on the modal inclination tO obtained from 
the histogram etc. calculating section 33 and the inclinations (of 
each band) ii calculated by the first through N-th inclination 
calculating sections 32 x through 32n-; — therefore . Therefore , the 
voiced/voiceless determining section 34 is adapted to receive the 
inclinations calculated by the first through N-th inclination 
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calculating sections 32i through 32 N . 

The voiced/voiceless determining section 34 compares the 

inclinations xi calculated by the first through N-th inclination 
calculating sections 32i through 32 N and the modal inclination tO, using 
the following inequality (4) : 

| xi - x 0 | < 5 (4) 
wherein 5 represents a threshold used for the determination 
(inclination threshold) . 

The voiced/voiceless determining section 34 determines a 

section to be a voiced section when the condition of the inequality 
(4) is satisfied with more than a predetermined ratio (YES) , and 
determines a section to be a voiceless section when the predetermined 
ratio is not satisfied (NO) . Then, the voiced/voiceless determining 
section 34 outputs the determination results to the sound input on/off 
control section 18. 

The sound signal processor 10 constructed as described above 

functions conoccut ivcly as follows. 

First of all , the first and second framing sections 11 and 12, 

the first and second frequency analyzing sections 13 and 14, and the 
cross-spectrum calculating section 15 calculate a cross-spectrum 
Gi 2 (co) of two-channel sound signals inputted from the first and second 
microphones 1 and 2 . 

Then, the phase extracting, section 16, the phase unwrap 

processing section 17, and the frequency band dividing section 31 
divide the phase of the thus calculated cross-spectrum G 12 (co) according 
to the band ( divide them divided into segments) , and the first through 
N-th inclination calculating sections 32 x through 32 N calculate the 
inclinations of the phase of each band (each segment) . 

Then, the histogram etc . — calculating section 33 generates a 

histogram based on the inclinations of each band (each segment) 
calculated respectively by the first through N-th inclination 
calculating sections 32 1 through 32 N , and the voiced/voiceless 
determining section 34 determines voiced and voiceless sections based 
on the incidences and the modal inclination i0 obtained from the 
histogram. Based on the determination results, the sound input on/off 



21 



control section 18 switches on and off the sound signals from the first 
microphone 1 and second microphones 1 and 2 to be output ted output 
to the next step. Specifically, when the voiced/voiceless determining 
section 34 detcrmined determines sound signals to be a voiced section, 
the sound input on/off control section 18 switches on to output the 
sound signals to the next step . When the voiced/voiceless determining 
section 34 detcrmined determines sound signals to be a voiceless 
section, the sound input on/off control section 18 switches off so 
as not to output the sound signals to the next step. 

In this manner, the sound signal processor 10 can dctoctdetects 
speech sections (voiced sections) contained in the sound picked up 
by the first microphonc and second microphones 1 and 2. 

Implementation of such a sound signal processor between the first 

microphonc and second microphones 1 and 2 and a voice application, for 
example, allowo enables the voice application to securely perform 
processing related to speech sections . The voice application includes 
a voice recognition system, a broadcasting system, a cellular phone, 
and a transceiver. For example, when the voice application is a voice 
recognition system, the voice recognition system can perform performs 
voice recognition based on the sound signals contained in speech 
sections outputted that are output by the sound signal processor 10. 
The effects are described next be low . 

As described previously, the phase of a cross-spectrum between 

the sound signals inputted input into the first and second microphones 
1 and 2 is detected, and speech sections contained in the sound signals 
picked up by the plural microphones are detected based on the 
inclination of the detected phase of the cross -spectrum with respect 
to the frequency. In other words, speech sections contained in the 
sound signals picked up by the plural microphones are detected 
utilizing the grcat signif icant difference in the phase components of 
a cross-spectrum generated from sound signals containing no voice 
(speech) and sound signals containing voice (speech) . 

Specifically, the phase of the cross-spectrum is divided 

according to the band (divided into segments) , a histogram is generated 
based on the inclinations of the phase of each band (each segment) , 
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an incidence (specifically mode) is obtained from the histogram, and 
speech sections are detected based on the incidence. 

This allowo enables accurate detection of speech sections. 

Further, utilizing such sound signals contained in the speech sections 
detected by the sound signal processor 10 allowo enables voice 
recognition with a high recognition rate/low misrecognition rate in 
a voice recognition system, hands-free, half-duplex operation with 
high reliability in a cellular phone and a transceiver, and reduction 
of the power consumption of the communication system in a broadcasting 
system. 

Even in the case of varying environmental changco conditions , 

such as a change in the mounted pooitiono m ounting locations of the 
microphones, and movement of the sound source, such as movement or 
a change in posture of the speaker, robus t out standing voice input eaR 
be— is achieved. 

As described previously, the inclination of the phase of a 

cross-spectrum with respect to the frequency changes depending depends 
on the difference between the distances from the sound source to the 
first and second microphones 1 and 2. Thus, when the mounted 
pooitiono mounting locations of the first and second microphones 1 and 
2 relative to the sound source are changed, for example, the inclination 
of the phase of the cross -spectrum with respect to the frequency is 
also changed in response to the changes in the pooitiono . locations . 
Meanwhile, as describe — described previously, the phase of the 
cross -spectrum is divided according to the band (divided into 
segments) , a histogram is generated based on the inclinations of the 
phase of each band (each segment) , an incidence (specifically mode) 
is obtained from the histogram, and speech sections are detected based 
on the incidence. In other words, speech sections are detected 
eventually , irrespective of the magnitude of the inclination of the 
phase of the cross-spectrum, or the distances from the sound source 
to the microphones 1 and 2. Therefore, even when the mounted 
positiono mounting locations of the first and second microphones 1 and 
2 relative to the sound source are changed, the detection results of 
speech sections are not affected. 
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As a result, even in the caoc of when environmental changes occur , 

such as in the mounted pooitiono m ounting locations of the microphones, 
and movement of the sound source, such as movement or a change in posture 
of the speaker, robuat outstanding voice input can be is achieved. In 
other words, robuot out standing voice input can be is achieved while 
kecping maintaining a high degree of freedom in the pooitiono locations 
of the microphones . 

— As described above, the aforementioned various effects can be 
attained are obtained even on the aooumption of the uoc of when using 
mountable microphones, which are compact and lightweight for easy 
mounting/ removing, can ensure short-distance sound pickup generally 
ao good ao similar to close- talking microphones, and can caac reduce 
the burden and discomfort when mounted to the user as compared to 
close- talking microphone headsets. 
(Example — (Firot Embodiment) ) 

— A detection of a speech section containing voice was performed 
using a system to which the present preferred embodiment of the present 
invention was applied. Sample oound uocd wao a A total of forty 
sentences with a voiceless section of about one second intervening 
between sentences was used as sample sound . Experiments were 
performed in the following environments: in a soundproof room, in 
a car, in an office space, and on a sidewalk. For evaluation, a frame 
was determined to be an error frame when (1) a voiceless frame was 
ffHr& incorrectly determined to be a voiced frame, or (2) judging -from 
its leading end and trailing end, a speech section was determined to 
be a non-speech section. As a comparison object (conventional 
example) , a method was used utilizing a Fisher's linear discriminant 
function using the average number of zero- crossings and the 
logarithmic power as variables. ■ 

_FIG. 7 shows the results. FIG. 7 shows the percentage of the 

ratio of error frames to the total frames (speech section misdetection 
rate) . In FIG. 7, the values designated as LDF are those obtained 
by the method utilizing the linear discriminant function, while the 
values designated as CRS are those obtained by the method utilizing 
the cross-spectrum (the present invention) . 
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As shown in FIG. 7, in a soundproof room and in an office space, 

there wao obocrved no great substantial difference was observed in 
resulting speech section misdetection rate between the method 
utilizing the average number of zero-crossings and the logarithmic 
power and the method of preferred embodiments of the present invention. 
However, in a car and on a sidewalk, the method according to preferred 
embodiments of the present invention demonotrated produced greatly 
improved results e#- in the speech section misdetection rate. Thus, 
the present invention functions effectively particularly in noisy 
environments . 

— A occond A second preferred embodiment is described 

hereinafter. 

FIG. 8 shows the conotruction of a sound signal processor 10 

e# according to the second preferred embodiment . In the second 
preferred embodiment, the sound signals picked up by the first and 
second microphones 1 and 2 are synthesized to be outputted: to a voice 
application ao the next otep . To this end, the second preferred 
embodiment includes a delay processing section 51 and a waveform' 
synthesizing section 52. The delay processing section 51 delays the 
sound signals from the second microphone 2 and outputs them to the 
waveform synthesizing section 52, and the waveform synthesizing 
section 52 synthesizes the sound signals from the first microphone 
1 and the sound signals of the second microphone 2 inputted from and 
delayed by the delay processing section 51 and outputs them. 

There io obocrved a A phase difference is observed between the 
sound signals picked up by plural microphones, such as the first and 
second microphones 1 and 2, because of the difference between the 
distances from the sound source to the microphones 1 and 2 . Therefore, 
in order to synthesize the sound signals picked up by plural microphones^ 
such as the first and second microphones 1 and 2, the a_delay-and-sum 
processing is ncccaoary required , in which-^- the difference between 
the arrival times of the sound signals from the sound source to the 
microphones 1 and 2 is corrected^ the phases of those signals are 
synchronized-^ and thereafter the sound signals are added. This is 
£ke — reason — that — fefee — second — cmbodimcnt Thus , the second preferred 
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embodiment preferably includes the delay processing section 51 and 
the waveform synthesizing section 52 as described prcvioualy above . 

In the foregoing first preferred embodiment (see FIG. 6), the 

mode calculating section 33C calculatcd calculates the modal 
inclination xO from the histogram. In the second preferred embodiment, 
the delay processing section 51 performs delay processing based on 
the modal inclination tO. A specific description is mad eprovided 
below. 

As shown in FIG. 3 and (A) of FIG. 4, the phase components of 

a cross-spectrum have a constant inclination in voiced sections . This 
inclination indicates the delay time between the channels of the first 
and second microphones 1 and 2 . 

Utilizing this relation ship , the delay processing section 51 

performs delay processing based on the modal inclination x 0 calculated 
by the histogram etc. calculating section 33 . Specifically, as shown 
in FIG. 6, the mode calculating section 33C outputs the modal 
inclination tO to the delay processing section 51, and the delay 
processing section 51 performs delay processing based on the inputted 
modal inclination xO. 

tO = x/n = 2n-n 0 /N [rad/point] (5) 

wherein the units for x and n are respectively radian and ■ frequency 
point (point) , N represents the number of FFT points, and n 0 represents 
the number of delay sampling points. From this relation ship , the 
number of delay sampling points n 0 using the modal inclination iO as 
a variable can be obtained by the following equation (6) : 
n 0 = iO/(2n/N) [point] (6) 

Then, using this number of delay sampling points n 0 , the delay 

time t 0 can bc is obtained by the following equation (7) : 
t 0 = n 0 /F s (7) 

wherein F s represents the sampling frequency, 1G kHz, for example^ 
16 kHz . 

The delay processing section 51 delays the sound signals inputted 
from the second microphone 2 based on the thug obtained delay time 
t 0 , and outputs them to the waveform synthesizing section 52. 
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The waveform synthesizing section 52 synthesizes the sound 

signals from the first microphone 1 and the sound signals ei -f rom the 
second microphone 2 inputted , which are input from and delayed by the 
delay processing section 51 , and outputs them. 

Synthesized sound signals may also be obtained in ouch a the 

manner as— described below. 

As deocribed previously described , X 2 (co) e j co to , obtained by 

multiplying the frequency function X 2 by the delay term eW 0 , is 
synchronized with the frequency function X 1; whereby the inverse 
Fourier transform of X x (co) + X 2 (co)e j co to can be dealt with is used as 
channel -synchronized- added sound. Utilizing this relation ship , 
synthesized sound signals are obtained. 

That is, first of all, the delay time t 0 is used to obtain the 

channel -synchronized- added sound Xi(co) + X 2 (co) e j co to on the frequency 
scale by the following equation (8) . Note that the delay time t 0 has 
the modal inclination xO as a variable as shown in the equations (6) 
and (7) . 

Xi(co) + X 2 (co)e j co to = {Re[Xi(©)] + j Im [X x (co) ] } + {Re [X 2 (co) ] (coscot 0 + 
jsincoto) + oIm[X 2 (co)] (coscot 0 + jsincot 0 )}. (8) 

Here, the channel -synchroni zed-added spectrum is a complex 
spectrum composed of a real part and an imaginary part represented 
respectively as follows: 

Re: Re [X 2 (co) ] coscoto - Im [X 2 (co) ] sincot 0 + Re[Xi(co)] 
Im: Re [X 2 (co) ] sincoto + Im [X 2 (co) ] coscot 0 + Re[Xi(co)] 

This processing is performed for each frame and then IFFT 
(inverse FFT) is performed for each frame, to obtain a frame string 
of the synchronized added sound. 

The Overlap-add method is then applied to the thuo obtained frame 
string, to obtain synchronized added sound, or synthetic signals of 
the sound signals of the first microphone 1 and the sound signals of 
the second microphone 2 . 

The Overlap-add method is a method in which inputted data strings 

s n (t) are added in overlapping relation as shown in FIG. 9. Here, s n (t) 
represents an n-th synthesized sound waveform frame. The symbol L 
in the figurc FIG. 9 represents a constant. 
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In the sound signal processor 10 constructed as described above, 

the delay processing section 51 delays the sound signals from the second 
microphone 2 and outputs them to the waveform synthesizing section 
52, and the waveform synthesizing section 52 synthesizes the sound 
signals from the first microphone 1 and the sound signals e£ -from the 
second microphone 2 inputted input from and delayed by the delay 
processing section 51 and outputs them. 

The effects achieved by this construction are as follows. 

As described in connection with the foregoing first preferred 

embodiment, the inclination of the phase of a cross-spectrum with 
respect to the frequency changes depending on the difference between 
the distances from the sound source to the first and second microphones 
1 and 2. The delay time is estimated from this inclination of the 
phase of a cross -spectrum with respect to the frequency. The value 
actually used for the estimation is designated as modal inclination 
i0. The use of the modal inclination xO in estimating the delay time 
ensures high accuracy of the estimated delay time. 

Further, by synthesizing the sound signals of the first and 

second microphones based on the delay time as described above, there 
can be provided high-quality synthesized sound signals are provided . 
For example, utilizing such synthesized sound signals, a voice 
recognition system can pcrf or mp erf orms voice recognition with a high 
recognition rate/low misrecognition rate, a cellular phone and a 
transceiver allo w provide conversations in high-quality sound, and a 
broadcasting system allowo provides high-quality broadcasting and 
recording . 

As in the foregoing first preferred embodiment, the use of the 

modal inclination xO in the estimation of the delay time also allowo 
robuot provides outstanding voice input, even in the caoc of with 
environmental changes, such as a change in the mounted 
pooitiona m ounting locations of the microphones, and movement of the 
sound source, such as movement or a change in posture of the speaker. 
In other words, robus t out standing voice input can be is achieved while 
kecping maintaining a high degree of freedom in the poo itiono locat ions 
of the microphones . 
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As described above, the aforementioned various effects can be 

attained are obtained even on the aooumption of the uac of when using 
mountable microphones, which are compact and lightweight for easy 
mounting/ removing, can ensure short-distance sound pickup generally 
ao good ao similar to close- talking microphones, and can caac reduce 
the burden and discomfort when mounted to the user as compared to 
close-talking microphone headsets. 
(Example — (Second Embodiment)) 

— A voice recognition experiment with acoustic models was 
conducted using the synchronized added sound (synthesized sound 
signals) generated by a system to which the present preferred 
embodiment of the present invention was applied. 

In this voice recognition experiment with acoustic models, first 

of all , acoustic models were prepared using learning data obtained 
from the synchronized added sound. The acoustic models prepared were 
as follows: 

(1) Four collection-environment -dependent HMMs (hidden Markov 

models) prepared for each collection environment, and 

(2) a col lection- environment -independent HMM acquired through 

learning using sound from all the collection environments. 

The collection- environments were the same as above: in a 

soundproof room, in a car, in an office space, and on a sidewalk. 

Then, a voice recognition experiment was conducted using the 

prepared acoustic models. 

The recognition task was continuous voice recognition, and the 

data for evaluation (sound for evaluation) were different aound sounds 
from that used in the learning. FIG. 10 shows the results of the voice 
recognition experiment. The results of the recognition rate with the 
mono- channel sound from the first and second microphones 1 and 2 are 
also shown as comparison objects (conventional examples) . The first 
and second microphones 1 and 2 were a glasses microphone and a chest 
microphone, respectively, for example. The glasses microphone refers 
to a microphone mounted to the frame of glasses. 

As shown in FIG. 10, the rcoult wao that the recognition rate 

with the synchronized added sound obtained by the present preferred 
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embodiment of the present invention exceeded the recognition rate with 
the mono-channel sound in a soundproof room, on a sidewalk, and in 
all the environments, except in a car. This demonstrated that the 
synchronized added sound generated by the system to which the present 
preferred embodiment of the present invention was applied was of high 
quality aloo in actual environments. 

A third preferred embodiment is described hereinafter. 
FIG. 11 shows the construction of a sound signal processor 10 
ef — according to the third preferred embodiment. The sound signal 
processor 10 of the sccond third preferred embodiment is a combined 
^egffl combination of the sound signal processors 10 of the foregoing 
first and second preferred embodiments. That is, the sound signal 
processor 10 of the third preferred embodiment includes a 
voiced/voiceless determining section 34, a delay processing section 
51 , a waveform synthesizing section 52 , and a sound input on/ off control 
section 18 at the oamc time . 

Conatructcd Configured as described above, the sound signal 

processor 10 of the third preferred embodiment operates as follows. 
Note that those sections not specifically described operate in the 
same manner as in the sound signal processors 10 of the foregoing first 
and second preferred embodiments of the present invention . 

The delay processing section 51 delays the sound signals of the 
second microphone 2 based on the modal inclination i0 calculated by 
the histogram etc. calculating section 33 (mode calculating section 
33C) . The waveform synthesizing section 52 synthesizes the sound 
signals of the second microphone 2 inputted input from and delayed 
by the delay processing section 51 and the sound signals from the first 
microphone 1, and outputs the synthesized sound signals to the sound 
input on/off control section 18. 

Meanwhile, the voiced/voiceless determining section 34 
determines voiced and voiceless sections based on the incidence 
obtained by the histogram etc. calculating section 33, and the sound 
input on/off control section 18 switches on and off to and not to output 
the sound signals (synchronized added sound signals) output^ed from 
the waveform synthesizing section 52 based on the determination 
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results . 

— Conotructcd go Configured as described above, the sound signal 
processor 10 of the third preferred embodiment can dcmonatratc provides 
the effects achieved by the sound signal processors 10 of the foregoing 
first and second preferred embodiments. 

That is, high-quality synthesized sound signals can bc are 

generated, allowing enabling accurate detection of speech sections 
contained therein. Further, even in the cqoo of with variations in 
environmental changco conditions , such as a change in the mounted 
poaitiona m ounting locations of the microphones, and movement of the 
sound source, such as movement or a change in posture of the speaker, 
robua t out standing voice input can bc are achieved. In other words, 

r obua t ou t s t andi ng voice input eon beis achieved while 

kecping maintaining a high degree of freedom in the pooitiono locations 
of the microphones. 

The descriptions of the preferred embodiments of the present 

invention have been made provided above. The application of the 
present invention, however, is not limited to the foregoing preferred 
embodiments . 

- For example, as shown in FIG. 12, the voiced/ voiceless 

determining section 34 compares the inclinations li calculated by the 
first through N-th inclination calculating sections 32 x through 32 N 
and the modal inclination tO, using the following inequality (9) : 

|xi - x 0 | < oca (9) 
wherein a represents a coefficient, and a represents a value physically 
included within the threshold used for the determination (inclination 
threshold) 6 described previously. For example, the point of 
providing 5 and oca is to distinguish the difference between the effects 
in detecting voiced sections due to the both values, namely 5 as a 
constant and oto as a variable progressively updated through real-time 
.learning . 

Since a in aa is updatable, the conditions for the determination 

of a voiced section may be made more strict to more accurcly effectively 
prevent fflirg incorrect determination of a voiceless section in quiet 
environments. Meanwhile, the conditions for the determination may 
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be made less strict to allow permit more stable detection of a voiced 
section in environments with increased background noise. Assuming 
that a adapted for quiet environments is used in environments with 
background noise, which case is equivalent to the case when 6 as a 
constant is used, there is a f car concern that a voiced section carrying 
overlapped noise and voice may not be mioocd ident if ied properly . 

In other words, 5 as a constant functions effectively in the 

detection of voiced sections when used in environments similar to the 
conditions under which that value was set, while ao as a variable 
functions effectively in the detection of voiced sections when used 
in a system intended to dynamically respond to environmental changes. 

The strictness of the determination may be increased and reduced 

by changing the coefficient a. 

In the foregoing preferred embodiments, the tendency that of the 

inclinations of each band conccntratc to be concentrated on a specific 
inclination was observed by creating histograms from these 
inclinations of each band. However, the tendency that of the 
inclinations of each band conccntratc to be concentrated on a specific 
inclination may be observed by another method. 

Also, in the descriptions of the foregoing preferred embodiments, 

the detection target sound was speech sound produced by humans. 
However, the detection target sound may be sound produced by 
oubotanccs sources other than humans . 

— In the descriptions of the foregoing preferred embodiments, 
the first and second framing sections 11 and 12, first and second 
frequency analyzing sections 13 and 14 , and cross-spectrum calculating 
section 15 implcmcnt pref erably use a cross-spectrum phase detection 
mcano detector for detecting the phase of a cross-spectrum between the 
sound signals inputted into plural microphones-^ the phase extracting 
section 16, phase unwrap processing section 17, frequency band 
dividing section 31, and first through N-th inclination calculating 
sections 32 1 through 32 N implcmcnt use an inclination detection 
mcano detector for detecting the inclination of the phase of the 
cross-spectrum detected by the cross-spectrum phase detection 
mcano detector with respect to the frequency-^ and the histogram etc. 
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calculating section 33 and voiced/voiceless determining section 34 
implement use a speech sound detection mcana detector for detecting 
a speech section contained in the sound picked up by the plural 
microphones based on the inclination with respect to the frequency 
detected by the inclination detection mcano detector . 

In addition, the histogram etc. calculating section 33 and delay 

processing section 51 implement — use a delay time detection 
moana detector for detecting the delay time between the sound signals 
picked up by the plural microphones based on the inclination with 
respect to the frequency detected by the inclination detection 
mcano ; detector , and the waveform synthesizing section 52 implcmcnto 
uses a sound signal aynthcaizing mcans synthesizer for synthesizing 
the sound signals inputted into the plural microphones based on the 
delay time detected by the delay time detection mcana detector . 

Further, the sound signal processor 10 of the foregoing preferred 

embodiments may be applied to a voice recognition device. In this 
case, the voice recognition device includes a voice recognition 
proccooing — mcano — processor for performing voice recognition 
processing of the sound signals contained in the speech section (speech 
sound) detected by the sound signal processor >10 , in addition to the 
components of the sound signal processor 10 as described above. 

Examples of voice recognition techniques include "VORERO" 

(trademark) , a voice recognition technique proposed by Asahi Kasei 
Kabushiki Kaisha (see , for example, the following website: 
http : //www . asahi -kasei . co . jp/vorero/ jp/vorero/f eature . html ) . The 
present invention may be applied to voice recognition devices using 
such voice recognition techniques. 

Furthermore, the sound signal processor 10 of the foregoing 

preferred embodiments may be implement cd provided on a computer. And, 
the processing operation of the sound signal processor 10 as described 
above may be performed on a computer with a predetermined program. 
In this case, such a program may be designed to make the computer perform 
procco3ing the process of detecting a target sound, the processing 
including-: — inputting detection target sounds outputted from a 
detection target sound source into plural microphones-^ detecting the 
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phase of a cross -spectrum between the sound signals inputted into the 
plural microphones-r^ detecting the inclination of the phase of the 
cross-spectrum with respect to the frequency due to the respective 
distances from the detection target sound source to the plural 
microphones-r^ and, based on the inclination, detecting the target 
sound outputted from the detection target sound source and picked up 
by the plural microphones . 

Alternatively, the program may be designed to make the computer 
perform proccooing a process of detecting the delay time in sound input, 
the processifig including-^ inputting sounds outputted from a sound 
source into plural microphones-^ detecting the phase of a 
cross-spectrum between the sound signals inputted input into the 
plural microphonesr^ detecting the inclination of the phase of the 
cross-spectrum with respect to the frequency due to the respective 
distances from the sound source to the plural microphones-r^ and, based 
on the inclination, detecting the delay time in sound reception from 
the sound source between the plural microphones . 

Industrial Applicability 

— The present invention allowo conotruction of provides a sound 
reception system cmploying which preferably uses mountable microphones 
and robuot againot which efficiently operates even when environmental 
fluctuations occur . 

While the present invention has been described with respect to 
preferred embodiments, it will be apparent to those skilled in the art 
that the disclosed invention may be modified in numerous ways and may 
assume many embodiments other than those specifically set out and 
described above. Accordingly, it is intended by the appended claims 
to cover all modifications of the invention which fall within the true 
spirit and scope of the invention. 
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